The data taken by kaggle which is community of data scientist and data enthusiasts.
I choose this data because there are a lot of good kernels about this data in Kaggle. Thanks to that, i will have the chance to learn from many people.
3 - https://github.com/meli-lewis/pycaribbean2016/blob/master/pycaribbean.ipynb
4 - https://www.kaggle.com/pmarcelino/comprehensive-data-exploration-with-python
5 - https://seaborn.pydata.org/generated/seaborn.swarmplot.html
7 - http://seaborn.pydata.org/generated/seaborn.pairplot.html
8 - https://www.kaggle.com/tannercarbonati/detailed-data-analysis-ensemble-modeling
9 - https://www.kaggle.com/bsivavenu/house-price-calculation-methods-for-beginners
11 - https://seaborn.pydata.org/generated/seaborn.FacetGrid.html
import pandas as pd
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
train = pd.read_csv("/Users/yetkineser/Desktop/mef Python/final project/data/train.csv")
test = pd.read_csv("/Users/yetkineser/Desktop/mef Python/final project/data/test.csv")
# train = pd.read_csv("C:/Users/A46988/Desktop/housePrice/train.csv")
# test = pd.read_csv("C:/Users/A46988/Desktop/housePrice/test.csv")
# Show first five row for train dataset
train.head()
# Show first five row for test dataset
test.head()
Note : Difference between train and test dataset, train has one more column. Because this dataset normally create a model for SalePrice. I use Training data for my analysis.
Training dataset: You present your data from your "gold standard" and train your model, by pairing the input with expected output.
Test dataset: In order to estimate how well your model has been trained (that is dependent upon the size of your data, the value you would like to predict, input etc) and to estimate model properties (mean error for numeric predictors, classification errors for classifiers, recall and precision for IR-models etc.)
# Show rows and columns number
train.shape
# Show Column names
train.columns
Note: .shape and .columns are attributes, not methods, so you don't need to follow these with parentheses ().
train.info()
You can find detailed description here.
# Show number and percentage of Missing data for each column which has missing data
total = train.isnull().sum().sort_values(ascending=False)
percent = (train.isnull().sum()/train.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data[missing_data['Percent']>0]
# Drop columns (6 columns)
# train.drop(['PoolQC','MiscFeature','Alley','Fence','FireplaceQu','LotFrontage'],axis=1,inplace=True)
# Show rows and columns number after drop columns
# train.shape
# We learn some of null values type
# print(train['MasVnrArea'].dtype)
# print(train['MasVnrType'].dtype)
# We fill null MasVnrArea variable with mean of MasVnrArea column
# MasVnrArea_mean = train.MasVnrArea.mean()
# train['MasVnrArea'] = train.MasVnrArea.fillna(MasVnrArea_mean)
# train['MasVnrType'].value_counts(dropna=False)
# We set null values to 'None' which is in the most rows in
# 'MasVnrType' column
# train['MasVnrType'] = train.MasVnrType.fillna('None')
# train['MasVnrType'].value_counts(dropna=False)
# train.dropna(inplace=True)
# Show rows and columns number after dropping Null columns
# train.shape
PoolQC: Pool quality. And null meaning is "No Pool".
MiscFeature: Miscellaneous feature not covered in other categories. And null meaning is "None".
Alley: Type of alley access to property. And null meaning is "No alley access".
Fence: Fence quality. And null meaning is "No Fence".
FireplaceQu: Fireplace quality. And null meaning is "No Fireplace".
LotFrontage: Linear feet of street connected to property. There is no evidence in data description about null values. But also there is LotArea column, maybe we cand find about null LotFrontage meanings thanks to LotArea columns.
GarageCond: Garage condition. GarageType: Garage location. GarageYrBlt: Year garage was built. GarageFinish: Interior finish of the garage. GarageQual: Garage quality.
Null values meaning, columns about garage is "No Garage". But I can't write inside of numerical data(GarageYrBlt) "No Garage" so it will stay null.
Null values meaning, columns about basement is "No Basement".
There is no information about null values about Masonry veneer columns.
train['PoolQC'] = train.PoolQC.fillna("No Pool")
train['MiscFeature'] = train.MiscFeature.fillna("None")
train['Alley'] = train.Alley.fillna("No alley access")
train['Fence'] = train.Fence.fillna("No Fence")
train['FireplaceQu'] = train.FireplaceQu.fillna("No FirePlace")
train['GarageCond'] = train.GarageCond.fillna("No Garage")
train['GarageType'] = train.GarageType.fillna("No Garage")
train['GarageFinish'] = train.GarageFinish.fillna("No Garage")
train['GarageQual'] = train.GarageQual.fillna("No Garage")
train['BsmtExposure'] = train.BsmtExposure.fillna("No Basement")
train['BsmtFinType1'] = train.BsmtFinType1.fillna("No Basement")
train['BsmtFinType2'] = train.BsmtFinType2.fillna("No Basement")
train['BsmtCond'] = train.BsmtCond.fillna("No Basement")
train['BsmtQual'] = train.BsmtQual.fillna("No Basement")
train[['LotFrontage','LotArea']].loc[pd.isnull(train['LotFrontage']) == True].head(12)
# train.groupby('Neighborhood')['LotFrontage'].median()
# neighbor = train[pd.isnull(train['LotFrontage']) == False].groupby('Neighborhood')['LotFrontage'].median()
# Change list to dataframe
# df1 = pd.DataFrame(data=neighbor.index, columns=['Neighborhood_1'])
# df2 = pd.DataFrame(data=neighbor.values, columns=['LotFrontage_1'])
# df = pd.merge(df1, df2, left_index=True, right_index=True)
#!pip install dfply
from dfply import *
train_2 = train[pd.isnull(train['LotFrontage']) == False]
Neighborhood = (train_2 >>
group_by(X.Neighborhood) >>
summarize(LotFrontage = X.LotFrontage.median()))
train_3 = (train >> inner_join(Neighborhood, by = 'Neighborhood'))
train_3
train_4 = train_3[pd.isnull(train_3['LotFrontage_x']) == True]
train_4
train_5 = (train_4 >> rename(LotFrontage = X.LotFrontage_y) >> select (X.Id, X.LotFrontage))
train_5
train_6 = (train_2 >> select (X.Id, X.LotFrontage))
train_2
train_7 = train_5 >> union(train_6)
train_8 = (train >> select(~X.LotFrontage) >> inner_join(train_7, by = 'Id'))
train = train_8
train[pd.isnull(train['LotFrontage']) == True]
train.loc[pd.isnull(train['MasVnrArea']) == True].head(12)
numerical = [f for f in train.columns if train.dtypes[f] != 'object']
numerical.remove('SalePrice')
numerical.remove('Id')
train[numerical].describe()
from IPython.display import display, HTML
# Assuming that dataframes df1 and df2 are already defined:
#print "Dataframe 1:"
#display(df1)
#print "Dataframe 2:"
#HTML(df2.to_html())
categorical = [f for f in train.columns if train.dtypes[f] == 'object']
for col in categorical:
count = train.groupby(col)['SalePrice'].count()
mean = round(train.groupby(col)['SalePrice'].mean()/1000,2)
min_ = round(train.groupby(col)['SalePrice'].min()/1000,2)
IQR1 = round(train.groupby(col)['SalePrice'].quantile(.25)/1000,2)
median = round(train.groupby(col)['SalePrice'].median()/1000,2)
IQR3 = round(train.groupby(col)['SalePrice'].quantile(.75)/1000,2)
max_ = round(train.groupby(col)['SalePrice'].max()/1000,2)
new_df = pd.concat([count, mean, min_, IQR1, median, IQR3, max_], axis=1)
new_df.columns = ['count', 'mean(1000k)', 'min(1000k)', 'IQR25(1000k)', 'median(1000k)', 'IQR75(1000k)', 'max(1000k)']
display(new_df.sort_values(by=['count'],ascending=False))
sales_price = round(train['SalePrice']/1000,2)
round(sales_price.describe(),2)
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
sns.distplot(train['SalePrice'], kde = True, color = 'orange', hist_kws={'alpha': 0.5})
import scipy.stats as st
y = train['SalePrice']
plt.figure(1); plt.title('Johnson SU')
sns.distplot(y, kde=False, fit=st.johnsonsu)
plt.figure(2); plt.title('Normal')
sns.distplot(y, kde=False, fit=st.norm)
plt.figure(3); plt.title('Log Normal')
sns.distplot(y, kde=False, fit=st.lognorm)
import matplotlib
import numpy as np
matplotlib.rcParams['figure.figsize'] = (15.0, 9.0)
prices = pd.DataFrame({"price":train["SalePrice"], "log(price + 1)":np.log1p(train["SalePrice"])})
prices.hist()
#skewness and kurtosis
print("Skewness: %f" % train['SalePrice'].skew())
print("Kurtosis: %f" % train['SalePrice'].kurt())
quantitative = [f for f in train.columns if train.dtypes[f] != 'object']
quantitative.remove('SalePrice')
quantitative.remove('Id')
f = pd.melt(train, value_vars=quantitative)
g = sns.FacetGrid(f, col="variable", col_wrap=3, sharex=False, sharey=False)
g = g.map(sns.distplot, "value")
qualitative = [f for f in train.columns if train.dtypes[f] == 'object']
def scatter(x, y, **kwargs):
plt.scatter(x = x, y = y)
plt.xticks(rotation=90)
def pairplot(x, y, **kwargs):
ax = plt.gca()
ts = pd.DataFrame({'time': x, 'val': y})
ts = ts.groupby('time').mean()
ts.plot(ax=ax)
plt.xticks(rotation=90)
f = pd.melt(train, id_vars=['SalePrice'], value_vars=quantitative)
g = sns.FacetGrid(f, col="variable", col_wrap=2, sharex=False, sharey=False, size=7)
g = g.map(scatter, "value", "SalePrice")
#g = g.map(pairplot, "value", "SalePrice")
train_2= train
train_2[['YrSold']] = train_2[['YrSold']].astype('object')
qualitative = [f for f in train_2.columns if train_2.dtypes[f] == 'object']
def boxplot(x, y, **kwargs):
sns.boxplot(x = x, y = y, color = "white", showfliers=False) # no outliers for boxplot
plt.xticks(rotation=90)
def swarmplot(x, y, **kwargs):
sns.swarmplot(x = x, y = y, size=2)
plt.xticks(rotation=90)
f = pd.melt(train, id_vars = ['SalePrice'], value_vars = qualitative)
g = sns.FacetGrid(f, col = "variable", col_wrap = 2, sharex = False, sharey = False, size = 8)
g = g.map(swarmplot, "value", "SalePrice")
g = g.map(boxplot, "value", "SalePrice")
#correlation matrix
corrmat = train.corr()
f, ax = plt.subplots(figsize=(20, 9))
sns.heatmap(corrmat, vmax=.8, annot=True);
# most correlated features
corrmat = train.corr()
top_corr_features = corrmat.index[abs(corrmat["SalePrice"])>0.6]
plt.figure(figsize=(10,10))
g = sns.heatmap(train[top_corr_features].corr(),annot=True,cmap="RdYlGn")
Most corelated numerical columns with SalePrice are:
OverallQual: Rates the overall material and finish of the house.
GrLivArea: Above grade (ground) living area square feet.
GarageCars: Size of garage in car capacity.
GarageArea: Size of garage in square feet
TotalBsmtSF: Total square feet of basement area.
1stFlrSF: First Floor square feet.
sns.set()
cols = ['SalePrice', 'OverallQual', 'GrLivArea', 'GarageCars', 'TotalBsmtSF', '1stFlrSF']
sns.pairplot(train[cols], size = 2.5, diag_kind="kde")
plt.show();
def regplot(x, y, **kwargs):
sns.regplot(x = x, y = y)
plt.xticks(rotation=90)
f = pd.melt(train[cols], id_vars=['SalePrice'])# , value_vars=quantitative
g = sns.FacetGrid(f, col="variable", col_wrap=2, sharex=False, sharey=False, size=7)
g = g.map(regplot, "value", "SalePrice")
import pandas as pd
dummies = pd.get_dummies(train[qualitative])
SalePrice = train['SalePrice']
frames = [SalePrice, dummies]
dummies_2 = pd.concat(frames, axis=1)
dummies_2.head()
corrmat = dummies_2.corr()
top_corr_features = corrmat.index[abs(corrmat["SalePrice"])>0.5]
plt.figure(figsize=(10,10))
g = sns.heatmap(dummies_2[top_corr_features].corr(),annot=True,cmap="RdYlGn")
Most corelated categorical columns with SalePrice are:
ExterQual: Evaluates the quality of the material on the exterior.
BsmtQual: Evaluates the height of the basement
KitchenQual: Kitchen quality
def regplot(x, y, **kwargs):
sns.regplot(x = x, y = y)
plt.xticks(rotation=90)
f = pd.melt(dummies_2[top_corr_features], id_vars=['SalePrice'])# , value_vars=quantitative
g = sns.FacetGrid(f, col="variable", col_wrap=2, sharex=False, sharey=False, size=7)
g = g.map(regplot, "value", "SalePrice")
train_2= train
train_2[['YrSold']] = train_2[['YrSold']].astype('object')
qualitative = [f for f in train_2.columns if train_2.dtypes[f] == 'object']
def boxplot(x, y, **kwargs):
sns.boxplot(x = x, y = y, color = "white", showfliers=False) # no outliers for boxplot
plt.xticks(rotation=90)
def swarmplot(x, y, **kwargs):
sns.swarmplot(x = x, y = y, size=2)
plt.xticks(rotation=90)
f = pd.melt(train, id_vars = ['SalePrice'], value_vars = ['ExterQual','BsmtQual','KitchenQual'])
g = sns.FacetGrid(f, col = "variable", col_wrap = 2, sharex = False, sharey = False, size = 8)
g = g.map(swarmplot, "value", "SalePrice")
g = g.map(boxplot, "value", "SalePrice")
- We can say excellent quality kitchen, basement and exterior material effect price noticeable.
sns.lmplot( x="TotalBsmtSF", y="SalePrice", data=train, fit_reg=False, hue="BsmtQual", legend=False)
g = sns.FacetGrid(train, col="BsmtQual",col_wrap = 2, size = 8)
g = g.map(plt.scatter, "TotalBsmtSF", "SalePrice", edgecolor="w")
- If Basement Quality in "Gd", "TA" and "Ex" there is correlation between Sale Price and Total Basement Floor.
- We can add one more column on our scatter plots.
g = sns.FacetGrid(train, col="BsmtQual", hue="ExterQual",col_wrap = 2, size = 6)
g = g.map(plt.scatter, "TotalBsmtSF", "SalePrice", edgecolor="w").add_legend()